Semi-supervised learning with committees: exploiting unlabeled data using ensemble learning algorithms

نویسنده

  • Mohamed Farouk Abdel Hady
چکیده

Supervised machine learning is a branch of artificial intelligence concerned with learning computer programs to automatically improve with experience through knowledge extraction from examples. It builds predictive models from labeled data. Such learning approaches are useful for many interesting real-world applications, but are particularly useful for tasks involving the automatic categorization, retrieval and extraction of knowledge from large collections of data such as text, images and videos. In traditional supervised learning, one uses ”labeled” data to build a model. However, labeling the training data for real-world applications is difficult, expensive, or time consuming, as it requires the effort of human annotators sometimes with specific domain experience and training. There are implicit costs associated with obtaining these labels from domain experts, such as limited time and financial resources. This is especially true for applications that involve learning with large number of class labels and sometimes with similarities among them. Semi-supervised learning (SSL) addresses this inherent bottleneck by allowing the model to integrate part or all of the available unlabeled data in its supervised learning. The goal is to maximize the learning performance of the model through such newly-labeled examples while minimizing the work required of human annotators. Exploiting unlabeled data to help improve the learning performance has become a hot topic during the last decade and it is divided into four main directions: SSL with graphs, SSL with generative models, semi-supervised support vector machines and SSL by disagreement (SSL with committees). It is interesting to see that semi-supervised learning and ensemble learning are two important paradigms that were developed almost in parallel and with different philosophies. Semi-supervised learning tries to improve generalization performance by exploiting unlabeled data, while ensemble learning tries to achieve the same objective by using multiple predictors. In this thesis, I concentrate on SSL by disagreement and especially on CoTraining style algorithms. Co-Training is a popular SSL algorithm introduced by

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

When Semi-supervised Learning Meets Ensemble Learning

Semi-supervised learning and ensemble learning are two important machine learning paradigms. The former attempts to achieve strong generalization by exploiting unlabeled data; the latter attempts to achieve strong generalization by using multiple learners. Although both paradigms have achieved great success during the past decade, they were almost developed separately. In this paper, we advocat...

متن کامل

Classifier Ensemble with Unlabeled Data

Ensemble learning aims to improve generalization ability by using multiple base learners. It is well-known that to construct a good ensemble, the base learners should be accurate as well as diverse. In this paper, unlabeled data is exploited to facilitate ensemble learning by helping augment the diversity among the base learners. Specifically, a semi-supervised ensemble method named Sealed is p...

متن کامل

Semi-Stacking for Semi-supervised Sentiment Classification

In this paper, we address semi-supervised sentiment learning via semi-stacking, which integrates two or more semi-supervised learning algorithms from an ensemble learning perspective. Specifically, we apply metalearning to predict the unlabeled data given the outputs from the member algorithms and propose N-fold cross validation to guarantee a suitable size of the data for training the meta-cla...

متن کامل

Regularized Boost for Semi-Supervised Learning

Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smo...

متن کامل

Agreement/Disagreement Classification: Exploiting Unlabeled Data using Contrast Classifiers

Several semi-supervised learning methods have been proposed to leverage unlabeled data, but imbalanced class distributions in the data set can hurt the performance of most algorithms. In this paper, we adapt the new approach of contrast classifiers for semi-supervised learning. This enables us to exploit large amounts of unlabeled data with a skewed distribution. In experiments on a speech act ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010